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INTRODUCTION 

The  human  speech  often  contains  sufficient  information  to 
identify  the  speaker  and  his  emotional  status  from  its  "sound". 
Included  in  the  actual  speech  signal  are  all  the  harmonics  and 
overtones  that  identify  the  speech  as  real  (human) .   This  extra 
information  is  desirable  in  normal  conversation,  but  it  con- 
stitutes a  waste  of  communication  capacity  in  the  case  of  speech 
transmission  over  long  distances.   The  question  as  to  what  part 
of  the  speech  spectrum  is  not  essential  to  its  intelligibility 
remains  unanswered. 

Speech  may  be  described  as  a  modulation  process  in  which 
at  least  two  modulated  carriers  contain  the  information.   The 
sound  of  the  vocal  cords,  which  represents  a  periodic  series  of 
pulses,  and  the  sound  of  exhaled  air,  which  represents  a  signal 
with  a  nearly  constant  spectrum,  constitute  the  two  carriers. 
The  sounds  of  the  vocal  cords  and  exhaled  air  will  be  referred  to 
as  "voiced"  and  "unvoiced"  respectively  (see  Fig.  1) .   The  voiced 
carrier  is  both  amplitude  and  frequency  modulated  in  order  to 
produce  loudness  and  pitch  changes.   Both  carriers  undergo 
frequency-noise  modulation  in  the  cavity  of  the  mouth,  and  thus 
generate  "formants".   The  frequency  and  amplitude  of  a  gross 
concentration  of  energy  in  the  spectrum  of  a  speech  sound  is 
defined  as  a  formant  (Flanagan,  1956) .   The  amplitude  modulation 
of  both  carriers  produces  consonants.   The  frequency  of  vibration 
of  the  vocal  cords  is  defined  as  the  pitch  frequency. 


The  concentration  of  energy  in  the  speech  spectrum  does 
not  change  rapidly  with  time.   An  inference   about  redundancy 
in  the  speech  sound  can  be  made  from  the  fact  that  the  ear  can 
identify  a  sound  on  receipt  of  only  a  portion  of  the  total  energy. 
Therefore  a  sampled  speech  signal  contains  essentially  all  the 
intelligibility  of  the  original  signal.   Since  a  speech  signal 
may  be  described  in  either  the  time  or  frequency  domain  by  the 
Fourier  integral  pair, 

s(t)  =  I  S(f)  ej27rftdf  (1) 

-00 

and 

S(f)  =  J  s(t)  e"j27rftdt,  (2) 

-  00 

either  or  both  time  and  frequency  sampling  may  be  employed. 

Flanagan  (1956)  shows  that  knowledge  of  the  first  three 
formants  is  sufficient  to  specify  most  voiced  and  unvoiced 
English  vowels  and  consonants,  and  that  most  of  the  significant 
information  in  speech  is  contained  in  the  frequencies  below 
3,000  cps.   From  this  it  is  inferred  that  the  speech  spectrum 
is  not  efficiently  used.   Thus,  continuous  identification  of 
sounds  is  unnecessary  and,  therefore,  the  entire  spectrum  need 
not  be  transmitted.   The  significance  of  the  reduced  information 
can  be  seen  from  Shannon's  equation  for  channel  capacity, 

C  =  W  log2   (1  +  §)  (3) 

where  C,  channel  capacity  in  bits  per  second,  is  directly  pro- 
portional to  channel  bandwidth,  W  is  the  channel  bandwidth  in 
cycles  per  second,  and  S/N  is  the  signal-to-noise  power  ratio. 


The  channel  capacity  required  to  transmit  the  formant  information 
of  speech  (Flanagan,  1956)  can  be  determined  from  the  truncated 

Fourier  series  representation  of  the  formant  signals 

N 

f (t)  =  )  (a   cos  nwt  +  b   sin  nwt)  (4) 

Li       n  n 

n  =  0 
where  W  =  2tt/T,  T  is  the  duration  of  the  sample  in  seconds,  and 

a   and  b   are  normally  distributed  random  variables  with  zero 

n      n  J 

mean.  If  the  channel  possesses  negligible  phase  distortion,  the 
bandwidth  necessary  to  transmit  f (t)  with  a  prescribed  accuracy 
may  be  computed  from  the  number  of  terms  in  the  series.  Tables 
I  and  II  show  the  results  of  Flanagan's  experiment.  Shannon's 
^'equation  also  shows  that  for  a  given  information  content,  band- 
width reduction  can  be  achieved  with  an  increase  in  the  signal- 
to-noise  ratio,  which  has  strong  limitations. 

TABLE  I.   Experimental  results  of  the  bandwidth  required  to 

transmit  the  first  three  formants  for  various  samples 
and  speakers  (Flanagan,  1956) . 
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Fig.  2.  Compressed  speech  channel  signal-to-noise  ratio  as  a  function 
of  original  channel  signal-to-noise  ratio  (Campanella,  1958). 
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TABLE  II.   Channel  capacity  necessary  to  transmit  the  formant 

signals  for  an  overall  S/N  of  40  db  (Flanagan,  1956) . 

Formant  W  (cps)  S/N  (db)  C  (Bits/sec) 

F-  7.1  33  78 

F2  6.7  24  54 

F3  5.3  20  35 


All  bandwidth  compression  techniques  attempt  to  eliminate 
at  least  part  of  the  insignificant  information  in  speech.   Time 
and  frequency  compression  methods  are  possible  because  most 
vowel  sounds  have  a  duration  in  excess  of  that  required  for  the 
ear  to  identify  the  sound.   Also,  a  typical  vowel  sound  contains 
many  repetitions  of  a  basic  vowel  waveform.   Furthermore,  the 
information  identifying  the  speaker  and  the  speaker's  emotional 
status  is  not  essential,  and  may  be  eliminated. 

Continuous  analysis-synthesis  methods  exploit  the  redundancy 
in  speech  as  well  as  the  fact  that  the  speech  does  not  occupy  all 
of  the  spectrum  space  all  of  the  time.   Discrete  sound  analysis- 
synthesis  methods  also  eliminate  redundancy  and  inefficient  use 
of  the  spectrum  as  well  as  identity  and  emotional  status  informa- 
tion. 

Sound  group  analysis-synthesis  methods  eliminate  all  insigni- 
ficant information  and  transmit  only  a  code  to  "call  out"  an 
entire  word  or  phase  held  in  storage  in  the  synthesizer.   This 
results  in  extremely  low  information  rates.   Table  III  shows 
some  comparative  channel  capacities  necessary  for  speech  trans- 


mission  by  various  means. 

TABLE  III.   Comparative  channel  capacities  necessary  for  speech 
transmission  (Slaymaker,  1959) . 

Coding  Method  Necessary  Channel  Capacity  (Bits/sec) 

Digitized  speech  waveform  30,000 

Phonetic  pattern  coded   speech  60 

Word  coded  speech  (at  120  words/min) 

Vocabulary  of  2  words  2 

Vocabulary  of  8,000  words  26 

Vocoder  2,000 

Teletype  (120  words/min)  75 


It  is  possible  to  relate  the  signal-to-noise  ratio  in  the 
compressed  speech  channel  to  the  bandwidth  reduction  factor,  the 
information  reduction  factor,  and  the  signal-to-noise  ratio  in 
the  original  speech  channel  by  the  use  of  Shannon's  equation  for 
channel  capacity.   For  example,  let  the  subscript  1  in  Equation 
(3)  refer  to  the  non-compressed  channel  and  the  subscript  2  refer 
to  the  compressed  channel. 
Then,  the  signal-to-noise  ratio  is  given  by 

.     - 

S2/N2  =  (1  +  Sl/Nl)  <W1/W2'  (<VC1>  -  1  (5) 

and  for  S2/N2  >>  1  and  S^/N.))  1  Equation  (5)  becomes 

S2/N2  =  «!/»!>  (Wl/W2'  (C2/Cl'-  <6) 

In  Fig.  2,  the  compressed  speech  channel  signal-to-noise  ratio  is 
plotted  as  a  function  of  the  signal-to-noise  ratio  of  the  non- 
compressed  speech  channel  for  valves  of  the  exponent  (Wj/W2)  (C^/C^j 


of  1  and  10.   The  unity  exponent  corresponds  to  the  case  when  the 
information  rate  is  reduced  by  the  same  factor  as  the  channel 
bandwidth.   Thus,  for  comparable  performance,  the  signal-to- 
noise  ratio  will  be  the  same  in  a  compressed  channel  as  in  a 
noncompressed  channel.   Also,  since  the  bandwidth  required  to 
transmit  the  compressed  channel  signal  is  reduced  by  the  factor 
(W./WO ,  the  white  noise  energy  picked  up  in  the  channel  is 
reduced  by  the  same  factor,  and  the  immunity  of  the  compressed 
speech  channel  to  noise  interference  is  improved  by  10  log 
(W1/W2)  db. 

The  exponent  (W../W2)  (C2/C. )  may  take  on  values  greater  than 
one  when  the  channel  capacity  is  not  reduced  as  much  as  the 
bandwidth  is  compressed.   In  this  case,  the  signal-to-noise 
ratio  in  the  compressed  channel  will  always  be  higher  than  that 
in  the  noncompressed  channel  for  comparable  performance.   In 
order  to  obtain  comparable  performance  from  a  compressed  channel 
and  a  noncompressed  channel,  with  (S-/N-)  =  20  db,  (W-./W2)  =  10, 
and  (C^/C.)  =  1,  a  signal-to-noise  ratio  of  200  db  would  be  re- 
quired in  the  former. 

The  preceding  discussion  points  out  that  the  effectiveness 
of  a  bandwidth  compression  system  cannot  be  measured  by  the 
bandwidth  reduction  factor  alone;  the  influence  of  information 
reduction  must  also  be  taken  into  account  in  terms  of  the  signal- 
to-noise  ratio,  or  the  information  rate  that  must  exist  in  the 
compressed  speech  channel  to  obtain  speech  reproduction  with  a 
reasonable  signal-to-noise  ratio. 
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Two  major  advantages  are  gained  by  bandwidth  compression. 
The  first  is  a  more  efficient  use  of  the  communication  space. 
Generally,  telephone  channels  have  a  3,000  cps  bandwidth.   This 
wide  bandwidth  greatly  limits  the  number  of  channels  possible 
in  the  alloted  communication  space.   If  the  bandwidth  can  be 
reduced,  say  by  a  factor  of  10,  then  the  number  of  possible 
channels  can  also  be  increased  by  a  factor  of  10.   The  second 
advantage  is  found  in  increased  noise  immunity.   The  noise  in 
communication  channels  is  usually  approximated  by  white  noise, 
that  is,  noise  with  a  constant  spectral  density.   Thus,  the 
total  noise  energy  in  a  channel  is  directly  proportional  to  the 
channel  bandwidth,  and  the  noise  immunity  improvement  is  directly 
proportional  to  the  bandwidth  reduction  factor.   This  is  espe- 
cially desirable  in  long,  noisy  communication  links. 


TIME  AND  FREQUENCY  COMPRESSION  METHODS 
Scan  Vocoder 

The  Scan  Vocoder  (voice-coder)  is  one  of  the  early  time 
compression  systems.   A  diagram  of  the  Scan  Vocoder  is  shown  in 
Fig.  3  (a) .   In  this  system,  the  transmission  of  the  speech 
signal  spectrum-envelope  requires  frequency  analysis.   This 
analysis  is  performed  by  a  set  of  magnetostriction  filters 
covering  the  frequency  range  from  130  to  133  kc. 

The  output  voltages  of  the  analyzer  filter  set  are  rec- 
tified, and  stored  in  capacitors.   These  voltages  are  then 
scanned  by  a  rotating  switch.   The  amplitudes  correspond  to  the 
envelope  of  the  voltage  labeled  as  (3)  in  Fig.  3  (b) .   All 
future  references  in  this  section  will  be  understood  to  be 
referred  to  Fig.  3  (b) .   The  sampled  output  is  then  smoothed  to 
obtain  the  envelope  as  indicated  by  waveform  (4) .   The  cut-off 
frequency  of  the  smoothing  filter  is  approximately  200  cps  which 
is  the  bandwidth  needed  for  transmission  of  the  envelope.   If  a 
switch  with  low  shunt  capacitance  is  used,  the  high  frequency 
filter  outputs  may  be  connected  directly  to  the. switch  contacts 
and  rectification  may  be  accomplished  with  a  single  rectifier 
located  between  the  switch  arm  and  the  low- pass  filter. 

The  multivibrator  and  the  hiss  generator  in  the  synthesizer 
are  controlled  by  the  pitch-frequency  signal,  so  that  the  modu- 
lator input  is  either  a  line  spectrum  (see  waveform  B)  or  a  noise 
spectrum  (see  waveform  C) .   The  upper  sideband  (130-133  kc)  of 
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the  suppressed  carrier  output  spectrum  (see  waveform  D)  is 
analyzed  by  another  set  of  magnetostricition  filters.   The 
filters  are  connected  in  pairs,  so  that  there  are  only  half  as 
many  outputs  as  there  are  in  the  analyzer.   The  filter  outputs 
are  connected  to  the  high-frequency  inputs  of  a  set  of  modulators. 
The  modulators  receive  a  control  voltage  from  the  contacts  of  a 
rotating  switch  (see  waveform  (4)  =  (E).)   Thus,  the  signal  is 
sampled  and  stored.   If  the  envelope  changes  between  samples, 
the  modulators  receive  the  corresponding  new  voltage  at  the  next 
sample.   The  modulator  outputs  are  connected  together  in  three 

groups.   Group  A  contains  the  modulators  1,  4,  7,  10,  ;  group 

B,  the  modulators  2,  5,  8,  11,  ;  and  group  C,  the  modulators 

3,  6,  9,  12,  .   A  phase  shift  of  0°,  120°,  and  240°  is  applied 

to  the  groups  A,  B,  and  C  respectively,  and  then  these  three  groups 
are  added.  This  procedure  restricts  the  modulation  effect  of  any 
modulator  to  the  frequency  range  of  its  corresponding  filter. 
This  is  necessary  since,  otherwise,  the  envelope  (see  waveform  F) 
of  the  sideband  will  be  highly  distorted.   The  complex  output 
voltage  of  the  three  groups  is  demodulated  by  mixing  with  a 
carrier  of  130  kc  (see  waveform  G) .   The  audio-frequency  band 
thus  detected  is  a  close  approximation  of  the  original  speech 
signal  (Vilbig  and  Haase,  1956) . 

The  Scan  Vocoder  is  very  complex  and  does  not  achieve  a  very 
large  bandwidth  reduction.   For  these  reasons,  this  system  has 
received  little  attention  in  the  published  literature  on  Speech 
Bandwidth  Compression. 
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Vobanc 

The  Vobanc  (Voice  Band  Compression)  is  a  speech  bandwidth 
compression  system  which  utilizes  frequency  division  and  multi- 
plication (Bogert,  1956) .   The  general  principle  is  to  divide 
the  speech  based  into  three  parts — 0.2-lkc,  l-2kc,  and  2-3. 2kc — 
using  filters  after  the  speech  signal  has  been  pre-molulated 
(see  Fig.  4) .   Each  of  these  bands  contains  one  of  the  vowel 
formants.   The  signal  in  each  band  is  passed  through  a  regenera- 
tive modulator  (see  Fig.  6)  which  halves  the  frequency  of  the 
strongest  components  of  the  formant,  and  translates  the  neigh- 
boring frequency  components  downward  by  a  factor  of  F/2,  where 
F  is  the  frequency  of  the  formant.   The  output  of  the  regenerative 
modulator  is  filtered  in  order  to  obtain  a  bandwidth  of  one-half 
that  of  the  original.   At  the  receiving  end,  the  frequency  of 
each  of  the  component  bands  is  translated  to  double  its  value, 
and  these  are  recombined  in  an  attempt  to  generate  the  original 
spectrum. 

A  block  diagram  of  the  Vobanc  is  shown  in  Fig.  4.   The  in- 
put speech  signal  is  modulated  by  a  108-kc  oscillator.   The 
difference  frequency  components  are  selected  by  "A"  filters  in 
three  separate  channels.   The  transmission  characteristics  of 
the  "A"  filters  are  shown  in  Fig.  5  (a) .   The  A-  filter  transmits 
a  band  from  107.8  to  107  kc,  which  corresponds  to  the  difference 
frequencies  resulting  from  the  modulation  of  the  108  kc  carriers 
by  a  frequency  in  the  range  of  0.2  to  1  kc,  corresponding  to  the 
first  formant  range.   Second  and  third  formant  ranges  are  trans- 
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mitted  by  filters  A2  (107  to  106  kc)  and  A3  (106  to  104.8  kc) 
respectively. 

The  output  of  each  of  the  "A"  filters  is  fed  to  a  regenera- 
tive modulator  (Fig.  6) .   The  input  of  frequency  f  is  modulated 
by  a  balanced  modulator,  whose  output  forms  the  input  to  a 
filter  which  selects  only  difference  frequencies.   This  output 
is  then  amplified  to  form  the  carrier  input  signal  to  the 
balanced  modulator.   No  feedback  develops  unless  an  input  signal 
is  applied.   The  circuit  has  a  dynamic  range  of  35  db. 

If  two  closely  spaced  frequencies  are  applied  to  the  regen- 
erative modulator,  the  average  frequency  of  the  input  signal  is 
halved,  while  the  difference  frequency  between  the  two  components 
remain  the  same.   For  speech  signals,  the  formant  frequencies 
are  halved  and  the  surrounding  frequencies  are  reduced  by  half 
the  formant  frequency.   The  spacing  between  harmonic  components 
of  the  speech  signal  remains  the  same  in  the  process,  but  the 
range  of  formant  variation  is  halved.   Thus,  at  the  output  of  the 
regenerative  modulator  the  speech  formant  range  can  be  included 
within  a  bandwidth  one-half  that  of  the  corresponding  A  filter. 
The  filters  which  select  the  half  frequency  components  are 
labeled  the  "B"  filters  (Figs.  4  and  5  (b) )  .   The  frequencies 
passed  by  the  "B"  filters  are  then  modulated  down  to  frequency 
range  75  to  1925  cps  by  the  output  mixers  of  the  transmitting 
terminal. 

At  the  receiving  terminal  the  compressed  speech  is  again 
modulated  up  to  the  frequency  range  of  the  "B"  filters.   Each  of 
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the  "B"  filters  output  is  doubled  in  frequency,  summed,  and 
filtered  to  restore  the  108-kc  carrier  range.   The  resulting 
signal  is  then  mixed  with  a  108  kc  local  oscillator  to  restore  the 
signal  to  the  audio-frequency  range. 

The  Vobanc  achieves  a  bandwidth  reduction  of  slightly  less 
than  2:1  due  to  the  guard-band  width  allowed  for  the  three 
channels.   The  quality  of  the  speech  is  reasonable,  as  only 
slight  distortion  is  introduced.   The  articulation  effeciency 
ranges  from  79  to  91  percent,  where  articulation  efficiency  is 
defined  as  the  number  of  words  understood  divided  by  the  total 
number  of  words  transmitted. 

Codimex  System 

The  Codimex  (compression-division-multiplication-expansion) 
system  falls  into  the  category  of  "formant  tracking"  devices, 
which  also  includes  the  Vobanc.   The  Codimex  system  uses  many  of 
the  principles  used  in  the  Vobanc  (Daguet,  1963) .   The  instan- 
taneous frequency  of  single  sideband,  suppressed  carrier  signal 
is  put  through  a  dividing  process  to  obtain  a  4  to  1  bandwidth 
reduction. 

The  spectral  analysis  of  the  compressed  formant  reveal  the 
following  effects: 

1.  Reduction  of  the  frequency  scale  excursion  by  a  factor 
of  8. 

2.  Concentration  of  the  spectrum  about  an  average  fre- 
quency. 
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3.   Increase  of  the  average  amplitude  level  which  shows 
only  slight  variation. 
The  Codimex  system  transmits  signals  representing  the  compressed 
formants.   These  signals  are  transmitted  at  similar  and  slightly 
varying  levels.   The  energy  of  the  signals  is  concentrated  in  a 
very  narrow  frequency  band. 

The  voice  signal  is  separated  into  three  parts  corresponding 
to  the  three  formant  frequency  ranges,  namely  300-700  cps,  700- 
2000  cps,  and  2,000-3,400  cps.   Each  formant  is  reduced  in  band- 
width by  a  separate  operation.   The  starting  point  is  the  separa- 
tion of  the  signal  amplitude,  a(t) ,  and  phase,  cos  0(t) ,  as 
functions  of  time,  in  such  a  way  that  the  real  signal,  S(t),  may 
be  represented  as 

S(t)  =  a(t)  cos  0(t)  .  (7) 

Let  S(t)  be  a  signal  occupying  a  limited  frequency  band  with 
finite  energy.   The  Hilbert  transform  pairs, 

oo 


<5~(t)  =  h  I  ■iV'V^  *r  (8) 

and 


*-«>  (r  "  fc) 


oo 


s(t)  =  -  U  #?tr a*  . .        l9) 

are  then  used  to  form  orthogonal  component  functions  in  such  a 
way  that 

f   (t)  =  S(t)  +  j/T(t)  =  a(t)  cos  0(t)  +  je(t)  sin  0(t) 

(10) 

where  f (t)  has  been  named  the  "analytic  signal"  by  Ville  (1948) 

and  it  was  first  introduced  by  Gebor  (1945) . 

Another  form  of  fit)    used  is 
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if  (t)  =  a(t)ej0(t)  (11) 

where  a(t)  is  always  a  positive  function.   The  actual  signal  is 
given  by 

S(t)  =  Re^(t)  =  a(t)  cos  0(t)  (12) 

as  before.   Thus,  the  function  a(t)  and  0(t)  are  unique  and 
entirely  determined  from  S(t) . 

The  process  of  frequency  compression  is  accomplished  by 
subjecting  the  analytic  function  to  a  square  root  extracting 
process  so  that 

TVTtT  =  -fime^^1.  (13) 

For  signals  corresponding  to  the  speech  formants,  the 
spectral  analysis  shows  that  the  bandwidth  reduction  is  propor- 
tional to  the  frequency  compression.   In  the  Codimex  system,  the 
square  root  extracting  process  is  repeated  three  times  given  the 
result 

&»(t^-  [*(ttf/8e&&-.  (14) 

The  received  signal  undergoes  the  reverse  process  at  the  receiving 
end. 

The  amplitude  and  phase  of  the  analytic  signal  may  be  ob- 
tained by  subjecting  the  real  signal  S(t)  to  a  single  sideband 
supressed  carrier  (S  S  B)  modulation  process ,      If  w  is  the 
carrier  frequency,  the  SSB  signal  will  be  a(t)  cos  jjwt  +  0(t)J  . 
The  single  sideband  signal  may  be  obtained  by  two  modulation 
processes  using  carriers  in  quadrature. 
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S(t)  cos  wt  -  <f(t)  sin  wt  =  a(t)  cos  0(t)  cos  wt  - 

a(t)  sin  0(t)  sin  wt 
=  a(t)  cos  Jwt  +  0(t)J.     (15) 

The  square  root  of  the  signal  a(t)  cos  [wt  +  0(t)/  is  ob- 

- 
tained  using  steps: 

1.  Detection  of  the  envelope  a(t)  (this  is  possible  due  to 
the  separation  of  the  spectra  of  a(t)  and  cos  Jwt+0(t)j  ). 

2.  Addition  of  the  envelope  to  the  SSB  signal  giving 
a(t)  (l  +  cos  (wt  +  0(t)]J  . 

3.  Feeding  this  signal  into  a  network  whose  output  is  pro- 
portional to  the  square  root  of  the  input,  results  in 

ja(t)  (l  +  cos  fct   +  0(t|}]^  =  T^aTt)"  jcos  ^g^.  +  0{t)jj    m 

4.  Division  by  two,  effected  by  switching  a  scale  of  two 

wt  +  0(t) 


each  time  the  signal 
zero. 


cos- 


passes  through 


Reversing  the  sign  of  cos 2  ^' — * —  each  time  the 

scale  of  two  is  switched,  thus  producing 


■/2alt) 


cos 


Wt  +  0(t) 

2 


The  square  root  operation  is  repeated  three  times  and  the 
three  channels  are  transmitted  by  frequency  multiplexing.   At  the 
receiver,  the  signal  is  demultiplexed  and  separated  into  the 
three  formant  bands.   The  compressed  waveform  is  then  squared  by 
means  of  a  fullwave  rectifier  operating  along  a  parabolic 
characteristic 
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fiiJt)    cos  ^  +/<*>  ]       =   2.(t)    cos2  a  +/(t'      = 

2«(t)    (l  +  cosrwt  f  tltjgj  (16) 

The  low  frequency  signal  a(t)  is  eliminated  by  a  high  pass  filter 
leaving  a(t)  cos  [yt   +  0(t)j  ,  which  is  demodulated  to  restore 
the  original  signal  S(t)  =  a(t)  cos  0(t) .   A  block  diagram  of 
the  Codimex  system  is  shown  in  Fig.  7. 

The  Codimex  system  provides  a  rather  modest bandwidth  reduc- 
tion, but  reproduces  good  quality  speech.   Further  bandwidth 
reduction  is  possible,  but  the  system  is  said  to  become  quite 
complex. 

Correlation  Vocoders 

Correlation  Vocoders  (Schroeder,  1962)  utilize  speech 
analysis  in  the  time  domain  by  correlation  techniques.   The 
Wiener-Khinchin  relationship  suggests  that  correlation  analysis 
can  take  the  place  of  spectral  analysis.   The  autocorrelation 
function  and  the  power  spectrum  of  a  given  signal  form  a  Fourier 

transform  pair, 

i   °° 
0(T)  -  —:  J.G(w)  cos  wr  dw  (17) 

*"m  CO 

and 

oo 

g(w)  =  J.0(T)  cos  wrar.  (18) 

-  CO 

Fano  (1950)  has  extended  this  relationship  to  include  short- 
time  analysis  as 

alrl      oo 
0t(T)  -  sy- —  J  G  (w)  cos  wrdw  (19) 

*"  -co  t 

and 
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00  \y\ 
G  (w)  =  J  e"a'  '  0  (T)  cos  wrdr,  (20) 

*~  -GO  *" 

where  0(70  is  the  short-time  autocorrelation  function  and  Gfc(w) 
is  the  power  spectrum  and  1/a  is  a  time  constant. 

The  autocorrelation  function  is  taken  over  an  interval  of 
approximately  30  milliseconds  for  speech  analysis,  and  is  given 
by 


0(70  =  S(t)  •  S(t  -T)  (21) 

where  the  bar  denotes  the  time  average.   0(7")  is  symmetric  in  y , 
and  is  bandlimited  to  the  same  frequency  range  as  the  signal. 
The  spectrum  jT(f ) ,  of  the  autocorrelation  function  is  the  abso- 
lute square  of  the  signal  spectrum,  S(f).   Therefore, 

&{f)    =|s(f)  |2.  (22) 

Thus,  0(T)  contains  the  same  information  as  the  amplitude  spec- 
trum of  the  signal  |s(f)j. 

An  autocorrelation  vocoder  is  shown  in  Fig.  8.   A  short-time 
autocorrelation  function  of  the  speech  signal  is  derived  for 

a  number  of  discrete  delays,  "Yq'  ^ \' ?w'  ^n  t*ie  analyzer.   The 

autocorrelation  function  is  completely  specified  for  discrete 
delays  and  has  a  spacing  of  AT=  *$f  ,  where  f   is  the  cut-off 
frequency  of  the  speech.   For  f   ^  3.3  kc,  a  AT  of  0.167  msec, 
suffices.   The  maximum  delay  for  which  the  short-time  autocorre- 
lation function  needs  to  be  specif ied  is  of  the  order  of  3  msec. 
In  Schroeder's  vocoder,  there  are  18  "delay  channels",  each  with  a 
bandwidth  of  20  cps  for  a  total  bandwidth  of  360  cps.   This 
approaches  a  bandwidth  compression  ratio  of  10s 1,  but  when  the 
pitch  channel  and  guard-bands  are  taken  into  account  the  ratio 
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is  reduced  to  the  order  of  9:1.   The  autocorrelation  vocoder 
conserves  bandwidth  in  a  manner  similar  to  the  spectrum 
analyzing  devices.   The  phase  information  is  discarded,  the 
autocorrelation  is  averaged  over  several  fundamental  periods, 
and  some  spectral  resolution  is  sacrificed. 

At  the  synthesizer  a  symmetrical  replica  of  the  autocorre- 
lation function  is  generated  for  every  pitch  period  by  recipro- 
cating scanning  (Fig.  9) .  Neglecting  the  truncating  distortion 
apparent  at  the  ends  of  some  scans,  the  synthesized  signal  has 
an  amplitude  spectrum  that  is  the  square  of  the  original  speech 
spectrum  [/(f)  =  |s(f)|]2  . 

The  spectrum  squaring,  inherent  in  autocorrelation  vocoders, 
needs  to  be  compensated  if  natural  sounding  speech  is  to  be 
obtained.   While  spectrum-squared  speech  is  fairly  intelligible, 
it  has  an  unpleasant  muffled  and  uneven  quality.   A  time-varying 
equalizer  (Fig.  10)  compensated  for  the  squared  spectrum.   It 
consists  of  three  filters  for  formant  extraction,  rectifiers, 
low- pass  filters,  square  root  extractors,  and  dividers.   In  the 
equalized  signal  at  the  output,  the  formant  amplitudes  are 
reduced  to  the  square  root  of  their  original  amplitudes,  thus 
compensating  for  the  spectrum  squaring  of  the  autocorrelation 
vocoder. 

Schroeder  rated  the  autocorrelation  vocoder  as  high  in 
intelligibility  and  fair  in  quality.   A  certain  distortion, 
attributed  to  the  chopping  of  the  individual  pitch  periods 
(see  Fig.  9) ,  was  noticeable.   To  reduce  this  distortion,  the 
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autocorrelation  coefficients  have  been  tapered  by  a  "Hamming" 
window  function, 

H(T)  =  0.54  +  0.46  cos  /CZL-C-  (23) 

max 

In  order  to  minimize  the  number  of  autocorrelation  signals 

to  be  transmitted,  the  maximum  delay,  T       >    for  speech  frequencies 

max       F  ^ 

above  2.5  kc  has  been  reduced  to  1.5  msec.  "T    has  been  main- 

max 

tained  at  2.5  msec  for  the  medium  speech  frequencies  (1.5  to  2.5 
kc) .   In  order  to  improve  the  spectral  resolution  at  low  fre- 
quencies, T*    has  been  extended  to  5  msec  for  frequencies  below 

max  ^ 

1.25  kc.   The  corresponding  increase  in  the  number  of  channels 
is  small  because  the  sampling  interval  for  low  frequencies  is 
relatively  large,  for  instance  at  0.625  kc,  the  sampling  interval 
is  0.4  msec.   The  total  number  of  channels  in  the  improved  version 
of  the  autocorrelation  vocoder  is  27  for  an  input  bandwidth  of 
5  kc.   A  bandwidth  reduction  of  the  order  of  9:1  is  maintained, 
however.  Schroeder  rated  this  version  of  the  autocorrelation 
vocoder  as  superior  or  equal  to  the  best  known  spectrum  channel 
vocoder  with  comparable  bandwidth  compression. 

The  problem  of  spectrum  squaring  in  a  correlation  vocoder 
can  be  avoided  by  cross-correlating  the  speech  with'  a  speech- 
derived  signal  having  a  flat  spectral  envelope.   A  cross-cDrrela- 
tion  analyzer  is  shown  in  Fig.  11.   The  spectrum  flattener  con- 
tains a  non-linear  network  producing  a  flat  distortion  spectrum 
at  the  output  for  a  variety  of  speech  inputs.   Several  such 
spectrum  f latteners  have  been  invented  at  Bell  Telephone  labora- 
tories and  are  described  in  the  literature. 
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A  spectrum  flatten er,   described  by  Schroeder  and  David 
(1960),  consists  of  a  piece-wise  linear  network  with  an  input- 
output  characteristic  of  straight-line  segments  (see  Fig.  12)  . 
The  output  is  +  1  volt  for  inputs  of  0  and  +  4  volts,  and  -  1 
volt  for  an  input  of  +  2  volts.   Thus,  for  an  input  voltage  of 
greater  than  3  volts,  the  output  will  contain  4  zero  crossings 
for  each  one  in  the  input.   The  spectrum  flattener  is  quite 
independent  of  the  form  of  the  input  (clipping  or  no  clipping) . 
This  multiplication  of  zeros  is  accompanied  by  the  desired 
spectral  flattening. 

The  synthesizer  of  a  cross-correlation  vocoder  is  identical 
to  that  of  an  autocorrelation  vocoder,  provided  that  the  cross- 
correlation  function  is  reasonably  symmetric  so  that  reciprocal 
scanning  can  be  made  nearly  symmetric  by  the  proper  choice  of 
the  reference  signal  with  which  the  speech  is  cross-correlated. 
One   such  reference  signal  consists  of  pulses  occurring  at  the 
relative  maxima  of  the  speech  signal  and  having  amplitudes  pro- 
portional to  the  square  of  the  speech  maxima. 

Schroeder  termed  the  cross-correlation  vocoders  as  inferior 
to  the  autocorrelation  type  for  two  reasons.   First,  the  cross- 
correlation  function  is  not  truly  symmetric,  thus,  a  synthesizer 
requiring  symmetry  produces  distortion  in  the  center  of  each 
pitch  period.   Secondly,  the  reference  signal  does  not  have  a 
truly  flat  spectrum.   Thus,  the  synthesized  signal  differs  from 
the  original  spectrum.   The  major  advantage  of  cross-correlation 
analysis  is  that  no  analog  multipliers  are  required  if  the 
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reference  signal  is  in  binary  form  (Schroeder,  1962) . 

Sampling  Techniques  for  Bandwidth  Compression 

Recently,  various  authors  have  suggested  sampling  speech 
signals  both  in  the  time-frequency  domain  and  in  the  frequency 
domain  alone.   Peterson  and  Subrahmanyam  (1959)  attempted  to 
compress  the  effective  speech  bandwidth  by  simultaneous  sampling 
in  the  time  and  frequency  domains,  but  the  results  were  not  very 
encouraging.   On  the  other  hand,  Kryter  (1960)  has  shown  that 
satisfactory  communication  could  be  obtained  by  sampling  the 
speech  with  three  500  cps  bandpass  filters,  but  the  total-  band- 
width in  this  case  comes  out  to  about  2340  cps  at  30  db.   Also, 
due  to  the  non-uniform  sampling,  the  different  bands  have  to  be 
translated  to  form  a  compact  spectrum  so  that  frequency-division 
multiplex  system  may  be  used.   A  method  of  uniform  sampling  in 
the  frequency  domain  along  with  a  necessary  correction  to  make 
direct  multiplexing  possible  has  been  shown  by  Das  (1961) . 

A  finite  sample  of  a  time  varying  signal  may  be  represented 
as  a  sum  or  integral  of  exponentials  as 

F(w,t)  =  ]T  AnePnt  .  (24) 

where  A  and  P  are  complex,  if  the  waveform  is  known  for  all  time. 
For  any  random  function,  such  as  speech,  a  possible  representa- 
tion in  the  time-frequency  plane  is  of  the  form 

F(*'«  =EZAranUmn(t)  <25> 

where  A   ,  given  by  S(m0,n/e),  are  the  coefficients  of  the 
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sampling  functions  U   ft)  corresponding  to  the  cross-points  of 
any  grid  laid  on  the  t-w  plane,  for  which  the  separation  of  the 
crossline  is  ©  in  time  and  1/0  in  frequency.   Then,  the  signal 
may  be  represented  by  (m  x  n)  numbers  for  a  given  ©,  and  the 
sampling  may  be  accomplished  in  either  the  frequency  or  time 
domain.   Ideally,  a  signal  of  bandwidth  W  and  duration  T  require 
2TW  numbers  to  specify  it  completely,  but  slowly-varying  signals, 
such  as  speech,  have  much  less  essential  information-contents 
than  specified  by  2TW  numbers. 

Dudley  (1940)  has  shown  that  the  voiced  sound  may  be  repre- 
sented as 

n  t 

Fv(w,t)  =  S<t)£r(w,t)Akcos|~kPj'  P(t)dt  +  ejl        (26) 

k  =  l  ° 

where  the  carrier  is  composed  of  n  audible  harmonics  of  relative- 
ly high  frequencies  having  amplitude  A.  ,  frequency  kP,  and  phase 
0,  .   S(t)  is  the  switching  function,  P(t)  is  the  inflecting 
factor,  and  r(w, t)  is  the  effect  of  selective  transmission.   The 
three  message  functions  S(t),  P(t) ,  and  r(w,t)  produce  the 
necessary  modulation  processes  on  the  carrier  at  the  low  rate  at 
which  the  syllables  are  formed.   The  total  information  is  mainly 
dependent  on  these  slowly- varying  parameters.   The  unvoiced  case 
is  only  a  degenerate  case  of  Equation  26. 

The  effect  of  sampling  in  the  frequency  domain  is  shown  in 
Fig.  13,  where  the  instantaneous  amplitude  of  the  component 

frequencies  may  be  represented  by  A- ,  A-,  ,  A  .   The  sampled 

signal,  consisting  of  A2,  A5,  AQ,  etc.,  has  holes  in  the  ampli- 
tude-frequency curve  and  consequently  the  intelligibility  and 
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naturalness  deteriorates.   A  process,  similar  to  the  pulse- 
lengthening  generally  used  in  time-divis'ion  multiplex  systems, 
is  used  to  improve  the  quality  of  the  signals.   The  holes 
created  by  sampling  are  filled  by  inserting  sidebands  corres- 
ponding to  the  accepted  bands,  but  differing  in  frequency  by 

(Af/3) ,  where  Af  =  (fg-f 2)  =  ^f8"f5)  =  (fn~fn-3)  *   The  recon- 
structed  amplitude  curve  would  then  be  of  the  staircase  type  and 
the  new  frequency-domain  representation  of  the  signal  would  be 


where 


F(w,t)  =  F1(w,t)  +  F2(w,t)  (27) 


Fl(w,t)  =  Am2Um2(t)  +  Am5Um5(t)  +  A^U^t)  +  — 


=  the  sampled  signal  (28) 

and 

P2(w't5  =  Am2[Uml(t)  +  Um3(t)]  +  Am5  [Um4(t)  +  Um6(t)J 

+  Am8[Um7(t)  +Um9(t)]  +  '"' 
=  sidebands  of  the  sampled  signal         (29) 
Some  measure  of  the  error  in  the  synthesized  signal  may  be 
obtained  from  the  relations 

Variation  ■  J"   D'(f)|  df  (30) 

Difference  =  j  |  D(f) |  df  (31) 

Square  of  difference  =  J  |D(f) |  2  df  (32) 

where  D(f)  is  the  difference  between  the  original  spectra  and 
the  synthesized  spectra,  the  difference  over  all  f  being  normal- 
ized to  zero. 

In  a  practical  sampling  system,  the  coefficient  A^  will 
represent  a  small  band  of  frequencies  and  it  is  necessary  to 
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determine  the  minimum  number  of  bands  that  are  to  be  transmitted 
as  well  as  the  width  of  each  band.   An  experimental  system  is 
shown  in  Fig.  14.   In  the  transmitter,  the  different  bands  are 
selected  by  bandpass  filters,  and,  in  the  receiver,  the  trans- 
mitted signal  as  well  as  the  sidebands  generated  in  the  balanced 
modulator  are  added  together  before  final  amplification.   In 
order  to  avoid  transient  disturbances  in  the  output,  the  sampling 
filters  should  have  a  smooth  cut-off  characteristic,  even  at  the 
expense  of  the  bandwidth  of  the  system.   Leakage  of  the  carrier 
in  the  receiver  causes  masking  of  the  signal  and  a  sharp  band- 
rejection  filter  is  used  to  eliminate  it  from  the  output.   An- 
other alternative  is  to  translate  the  speech  signal  to  a  higher 
band  of  frequencies,  say  6-10  kc,  then  sample  it,  mix  it  with  its 
sidebands,  and  then  retranslate  it  back  to  its  original  frequency 
band. 

Das  (1961)  found  that  with  six  filters,  each  with  a  band- 
width of  200  cps,  and  with  Af  (gap  width)  equal  to  600  cps,  that 
without  the  addition  of  the  sidebands  the  system  had  an  articu- 
lation efficiency  of  about  85%.   With  the  addition  of  the  side- 
bands, the  articulation  efficiency  approached  100%.   The  articu- 
lation efficiency  was  found  to  improve  with  increased  bandwidth. 
The  optimum  shift  of  the  samples  was  found  to  be  about  150  to 
250  cps,  and  any  attempt  to  fill  up  completely  the  gaps  having 
Af  y   750  cps  tended  to  deteriorate  the  receiver  output.   Tests 
were  initially  performed  with  a  time-varying  carrier  being  fed 
to  the  modulator  to  cover  the  wider  gaps  in  the  spectrum,  but 
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the  results  with  the  fixed  carrier  were  found  to  be  better.   The 
sampling  bandwidth  may  be  decreased  to  about  100  cps,  but  the 
filter  transients  become  prominent  in  the  output.   The  carrier 
suppression  in  the  modulator  has  to  be  more  than  50  db.   Because 
of  the  smaller  effective  bandwidth  of  the  system,  the  signal-to- 
noise  ratio  in  the  output  is  also  improved. 

Das  concluded  that  since  the  sampling  is  uniform,  different 
channels  may  be  multiplexed  without  any  frequency  translation  of 
the  different  filter  outputs,  and  that  the  bandwidth  compression 
possible  by  this  method  is  superior  to  that  of  other  similar 
methods . 
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CONTINUOUS  ANALYSIS-SYNTHESIS  METHODS 

Channel  Vocoders 

In  1939,  Dudley  introduced  the  first  channel  vocoder  shown 
in  Fig.  15.   It  recognized  that  speech  may  be  voiced  or  unvoiced, 
and  that  intelligibility  is  retained  by  preserving  the  short- 
time  amplitude  spectrum.   A  set  of  band-pass  filters  (BP.-BP  ) , 
with  rectifiers  and  low- pass  filters,  produces  the  discrete  short- 
time  speech  spectrum.   A  separate  device,  called  the  pitch  ex- 
tractor (Fig.  16) ,  develops   a  voltage  proportional  to  the 
fundamental  frequency  of  the  voiced  sounds.   The  pitch  control 
voltage  is  also  used  to  control  voiced-unvoiced  selection.   The 
pitch  voltage  takes  on  values  above  a  certain  threshold  for 
voiced  sounds,  but  remains  at  a  steady  state  value  below  the 
threshold  for  silence  and  unvoiced  sounds.   The  pitch  signal 
modulates  the  frequency  of  a  cord-tone  generator  (buzz)  at  the 
receiver  and  selects  either  the  cord-tone  or  noise  for  excitation 
in  the  synthesis.   The  spectrum  signals  are  applied  to  modulators 
(M  in  Fig.  15)  at  respective  inputs  to  an  identical  set  of  band- 
pass filters.   The  filter  outputs  are  summed  and  the  short-time 
spectrum  is  reconstructed.   Each  spectrum  channel  requires  about 
20  cps  bandwidth  and  a  signal-to-noise  ratio  somewhat  less  than 
that  of  a  conventional  telephone  circuit.   Generally,  the  pitch 
channel  requires  about  twice  the  bandwidth  of  the  spectrum 
channel.   The  channel  vocoder  can  transmit  highly  intelligible 
speech  at  an  information  rate  of  the  order  of  2000  bits  per 
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second  using  16  channels  to  span  the  telephone  band  (200-3200  cps). 
However,  the  speech  quality  of  the  vocoder  is  poor  due  to  pitch 
errors  and  inadequate  voiced-unvoiced  detection. 

Efforts  to  improve  the  speech  quality  led  to  the  development 
of  a  split-band  vocoder  (Fig.  17) .   The  split-band  vocoder  trans- 
mits a  baseband  (the  lower  one- third  to  one-half  of  the  speech 
spectrum)  over  a  conventional  channel  (no  processing)  and  trans- 
mits the  upper  portion  of  the  spectrum  over  several  vocoder 
channels.   The  baseband  is  retarded  at  the  receiver  to  equalize 
the  delays  in  the  vocoder  channels  and  all  the  channels  are  re- 
combined.   Only  noise  excitation  is  used  for  the  high-band 
synthesis,  and  voiced-unvoiced  switching  is  performed  by  the  base- 
band signal.   The  addition  of  the  baseband  corrected  the  pitch 
errors  and  substantially  improved  the  speech  quality.   However, 
voiced-unvoiced  detection  still  remained  a  problem  (Flanagan, 
1959)  since  no  voiced  excitation  could  be  delivered  to  the  high 
band. 

Voice  Excited  Vocoders 

The  speech  spectrum  reflects  the  nature  of  the  vocal  excita- 
tion, and  its  description  in  vocoder  transmission  requires  a 
voiced-unvoiced  decision,  based  upon  the  amount  of  energy  con- 
tained in  a  low  frequency  band  which  includes  the  pitch  fre- 
quency.  The  pitch  frequency  is  determined  by  an  average  zero- 
crossing  count  of  the  lowest  frequency  component  of  the  speech 
spectrum.   The  reliability  of  this  decision  and  the  accuracy  of 
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the  measurement  of  the  pitch  frequency  depends  critically  upon 
the  input  speech  quality,  and  in  particular  on  signal-to-noise 
ratio  and  low-frequency  equilization. 

A  voice-excited  vocoder  (Schroeder  et  al.,  1962)  avoids  such 
difficulties  by  generating  the  excitation  from  an  uncoded  base- 
band of  the  original  speech.   The  baseband  may  be  added  directly 
to  the  output,  but  its  main  function  is  to  provide  excitation  for 
the  synthesizer.   A  wide  band  excitation  is  generated  from  a 
narrow  baseband  by  nonlinear  distortion   (see  cross-correlation 
vocoder  spectrum  flattener,  Fig.  12) ,  which  produces  either  a 
flat  spectrum  of  noise  or  harmonic  frequency  components,  depend- 
ing on  its  input.   Thus,  the  excitation  is  reproduced  from  the 
original  speech  and  is  not  a  result  of  any  coding  procedure. 

This  method  is  quite  insensitive  to  input  conditions  and 
thus  avoids  the  pitch  problem.   The  voiced-unvoiced  decision  is 
also  bypassed  since  this  information  is  carried  explicitly  in 
the  baseband.   The  voice-excitation  also  removes  much  of  the 
electrical  accent  inherent  in  channel  vocoders.   However,  band- 
width is  sacrificed  for  the  baseband  transmission. 

A  major  technical  problem  in  voice-excitation  is  the  re- 
quired spectrum  flattening.   All  schemes  use  nonlinear  distor- 
tion to  multiply  the  number  of  zero-crossings  as  described  for 
the  cross-correlation  vocoder.   The  characteristics  of  such  a 
device  is  similar  to  that  shown  in  Fig.  12. 

In  order  to  achieve  a  reasonable  bandwidth  reduction  factor, 
the  baseband  must  be  as  narrow  as  possible.   The  nominal  minimum 
for  a  wide  range  of  speakers  without  incurring  some  degradation 
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is  about  700  cps  (Schroeder  et  al.,  1962).   A  filter  bank  is 
also  needed  in  order  to  flatten  such  a  band.   The  baseband  is 
first  spread  by  rectification  into  a  wider  band,  and  the  spectral 
shape  fluctuation  are  similar  to  that  of  the  baseband.   This 
fluctuation  is  removed  by  narrow-band  filtering  and  clipping. 

The  intelligibility  of  the  voice-excited  vocoder  does  not 
depend  on  whether  the  baseband  is  added  to  the  output  or  used 
only  for  excitation/  because  the  voice-excitation  mechanism 
preserves  the  rapid  and  inherent  speech  pitch  fluctuations.   The 
channel  vocoder  pitch  circuit  removes  such  desired  fluctuations 
by  averaging.   Also,  the  conventional  excitation  is  either  a  quasi- 
periodic  waveform  or  noise,  while  the  voice-excited  vocoder  (VEV) 
has  a  mixture  (quasi-periodic  for  some  frequencies,  random  noise 
for  others)  which  can  be  appropriately  reproduced. 

Schroeder  et  al.,  (1962)  found  the  VEV  to  be  superior  to  the 
channel  vocoder  in  the  quality  of  reproduced   speech,  but  inferior 
in  bandwidth  reduction  due  to  the  extra  bandwidth  required  for  the 
baseband.   Recent  work  by  these  authors  has  made  it  possible  to 
reduce  the  baseband  to  between  500  and  600  cps  without  appreciable 
degradation  of  the  speech.   Ten  to  twelve  vocoder  channels  were 
found  to  be  needed  for  satisfactory  operation.   The  bandwidth 
of  this  system  is  between  800  and  1000  cps  as  compared  to  approxi- 
mately 400  cps  for  the  channel  vocoder. 

The  transmission  bandwidth  for  a  bandwidth  compression 
system  depends  on  the  type  of  modulation.   Ordinarily  single- 
sideband  transmission  should  be  used  for  the  baseband  but  the 
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requirement  to  preserve  the  dc  component  of  the  channel  signals 

prevents  its  use,   therefore  Schroeder  et  al.  (1962)  proposed 

quadrature  modulation.   In  such  a  process,  two  signals  are 

amplitude-modulated  (DSB)  onto  a  single  carrier  frequency/  w  , 

such  that  one  of  the  signals  is  modualted  onto  cos  w  t  and  the 

o 

other  onto  sin  w  t.   Coherent  carriers  and  a  product  detector 
are  required  at  the  receiver.   In  the  usual  multiplex  situation, 
for  instance,  two  normal  voice-channels,  carrier  coherence  cannot 
be  assured  to  within  the  tolerance  necessary  for  holding  "cross- 
talk" between  channels  within  acceptable  bounds.   Crosstalk  is 
defined  as  mutual  overlapping  of  information  of  adjacent  channels. 
However,  vocoder  channel  signals  are  highly  correlated  and  because 
the  ear  is  not  too  sensitive  to  spectral  overlap  within  the  same 
speech  subbands,  the  crosstalk  between  adjacent  channels  needs 
to  be  held  to  within  about  20  db,  which  is  easily  realizable  by 
quadrature  modulation  systems. 

Resonance  Vocoders 

Experiments  in  the  analysis  and  preception  of  speech  show 
that  vowel  sounds  may  be  identified  and  synthesized  from  a  know- 
ledge of  the  formant  frequencies  (Flanagan,  1956) „   A  formant 
extracting  device  for  use  in  speech  compression  systems  must 
accept  continuous  speech  at  its  input  and  produce  output  voltages 
with  time  varying  amplitudes  representing  the  formant  frequencies. 
A  formant  extracting  device  (Fig.  18)  developed  by  Flanagan 
(1956)  divides  the  speech  spectrum  into  formant  frequency  ranges. 
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The  frequency  with  maximum  spectrum  amplitude  within  each  fre- 
quency range  is  then  detected.   This  spectrum-segmentation  method 
is  based  upon  the  fact  that  the  first  three  formant  fall  in 
frequency  ranges  which,  on  the  average  do  not  appreciably  over- 
lap.  An  appropriate  short-time  spectrum  of  the  input  speech 
signal  is  obtained  using  a  set  of  analyzing  filters.   This  set 
is  composed  of  36  contiguous  band-pass  filters  having  a  common 
input,  but  separate  outputs.   Each  channel  of  the  set  includes  a 
tuned  circuit,  an  amplifier,  a  full-wave  rectifier,  and  a 
smoothing  network  with  a  time  constant  of  10  milliseconds.   The 
center  frequencies  of  the  filter  channels  are  set  on  a  Koenig 
frequency  scale  (logarithmic)  extending  from  150  cps  to  7  kcs. 
The  bandwidth  of  the  filter  channels  are  100  cps  for  frequencies 
below  1  kc  and  increase  logarithmically  from  100  cps  at  1  kc  to 
450  cps  at  7  kc.   The  adjacent  channels  overlap  at  the  half-power 
frequencies.   The  gain  and  bandwidth  of  each  channel  may  be 
adjusted  independently.   The  useful  dynamic  range  of  each  channel 
is  greater  than  30  db  (Flanagan,  1956) . 

The  speech  spectrum  slopes  downward  at  about  10  db/octave, 
on  the  average.   Thus,  it  is  desirable  to  perform  a  frequency 
equalization  that  permits  all  of  the  filter  channels  to  operate 
at  about  the  same  signal  level.   It  is  also  desirable  to  obtain 
a  spectral  output  in  which  all  the  maxima  are  approximately  of  the 
same  amplitude  in  order  to  alleviate  the  problem  of  dynamic 
range  in  the  formant-analyzing  equipment.   A  driver  amplifier 
employing  an  equalizing  network  is  used  to  supply  the  input  to 
the  filter  set.   The  frequency  response  of  the  equalizing  net-  ~ 
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work  rises  at  approximately  10  db/octave  between  750  and  3000 
cps,  and  is  essentially  flat  outside  this  range.   The  network  and 
the  driver  amplifier  are  an  integral  part  of  the  filter  set.   Any 
reference  to  the  filter  set  in  the  following  discussion  assumes 
that  the  input  speech  is  equalized  according  to  the  frequency 
characteristic  mentioned  above. 

The  outputs  of  the  analyzing  filter  set  are  separated  into 
groups  to  cover  the  formant  frequency  ranges,  0  ^  F-  ^i  800, 
800  ^  F2  ^  2280,  and  2280  ^  F-  cps  respectively.   The  outputs  of 
each  group  of  filter  channels  are  monitored  and  the  channel 
having  the  maximum  output  within  each  group  is  selected  and 
sampled  at  a  rate  of  60  times  per  second  to  indicate  the  formant 
frequency. 

A  normalizing  circuit  computes  the  mean  value  of  its  set  of 
input  voltages  and  subtracts  this  mean  value  from  each  of  the 
inputs.   It  provides  one-half  of  this  difference  at  each  corres- 
ponding output.   For  example,  if  e,  is  the  voltage  input  to  the 
normalizing  circuit  from  the  kth  filter  channel  of  a  group  of 

N  channels,  then  the  normalized  kth  channel  voltage  is 

N 


ek=  ** 


ek  "  (1/N)  I  en 


(33) 


n  =  l  -J 
This  constraint  on  the  mean  value  of  the  set  of  voltages  to  zero 

without  altering  the  relative  amplitudes  permits  reliable  selec- 
tion of  the  maximum  voltage  over  a  range  of  mean  amplitude 
greater  than  30  db. 

The  normalized  set  of  voltages  of  any  one  group  is  sent  to 
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the  appropriate  grids  of  a  thyratron  maximum-amplitude  selector. 
The  set  of  thyratron  tubes  has  a  common  load  resistor  and  the 
plate  supply  voltage  is  effectively  turned  on  and  off  at  a  rate 
of  60  times  per  second.   The  thyratron  having  the  highest  posi- 
tive grid  voltage  will  fire  first  and  preclude  the  firing  of 
any  other  tube  in  the  presence  of  the  plate  voltage.   A  potenti* 
ometer  is  connected  as  the  cathode  resistor  of  each  thyratron 
and  the  output  is  taken  from  the  arm  of  the  potentiometer.   The 
potentiometer  is  set  so  that  the  output  voltage  (when  the  tube 
fires)  is  proportional  to  the  frequency  of  the  channel  that  the 
tube  is  monitoring.   All  of  the  potentiometer  arms  are  connected 
to  a  resistance  adder  to  provide  a  single  output  from  the  selec- 
tor.  The  output  voltage  from  the  selector  is  a  series  of  rec- 
tangular pulses  whose  heights  correspond  to  the  frequency  of  the 
channel  selected  as  having  the  maximum  output. 

The  clamper  is  a  filter  which  has  an  impulse  response  with 
Laplace  transform 

(1/s) (1  -  e"sT)  (34) 

where  T  is  the  selecting  or  firing  period  of  the  selector.   The 
selector  output  pulses  are  fed  into  the  clamper  for  smoothing. 
A  gate  pulse  is  generated  in  the  clamper  by  a  one-shot  multi- 
vibrator that  is  triggered  each  time  a  thyratron  fires.   The 
gate  pulse  samples  the  heights  of  the  successive  output  pulses 
from  the  selector.   This  is  the  output  of  the  formant  extracting 
system  (Flanagan,  1956) .   This  system  is  very  stable,  and  its 
calibration  can  be  matched  to  essentially  any  single- valued 
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function  relating  formant  frequency  and  output  voltage. 

The  resonance  or  formant  vocoder  (Fig.  19)  uses  the  above 
described  formant  extractor  as  the  vocoder  analyzer.   The  excita- 
tion data  are  handled  essentially  in  the  same  manner  as  in  the 
channel  vocoder.   Two  voltages  are  taken  from  the  formant  ana-, 
lyzers.   One  is  proportional  to  the  amplitude  (A.)  of  the  spec- 
tral maximum,  and  the  other  to  the  frequency  (F.)  of  the  maximum. 
The  frequency  voltages  tune  formant  resonators,  and  the  amplitude 
voltages  modulate  (see  blocks  M,  Fig.  19)  the  inputs  to  the 
resonators. 

The  problems  that  plague  the  channel  vocoder  are  also  pre- 
sent in  the  resonance  vocoder.   Flanagan  (1959)  found  the  intel- 
ligibility of  the  resonance  vocoder  to  be  inferior  to  that  of  the 
channel  vocoder.   The  split-band  or  hybrid  idea  again  offered  a 
practical  compromise.   Flanagan  (1959)  developed  a  resonance 
vocoder  with  a  baseband  complement  (Fig.  20)  .   The  baseband 
covers  the  frequency  range  from  300  to  800  cps  and  is  trans- 
mitted without  further  processing.   Another  band  covers  the 
range  of  800  to  3200  cps  and  is  transmitted  by  a  resonance 
vocoder.   Each  amplitude  and  frequency  signal  is  passed  through 
a  low- pass  filter  with  a  bandwidth  of  15  cps  and  an  18  db  per 
octave  cut-off  rate. 

The  baseband  is  delayed  by  15  milliseconds  at  the  synthe- 
sizer to  equalize  its  delay  to  that  of  the  processed  band.   The 
excitation  is  derived  by  means  of  nonlinear  distortion  of  the 
baseband  as  discussed  previously.   The  frequency  response  of  the 
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resonance  vocoder  and  baseband  complement  is  shown  in  Fig.  21. 
An  improvement  in  the  system  can  be  made  by  widening  the 
baseband,  or  by  increasing  the  number  of  vocoder  channels  for 
the  higher  band,  but  this  results  in  a  reduction  of  bandwidth 
compression. 

The  Harmoniphone 

The  harmoniphone,  as  shown  in  Fig.  22  (Pirogov,  1959) , 
employs  harmonic  functions  for  the  coding  and  synthesis  of 
speech  information.   The  analyzer  at  the  transmitting  end  per- 
forms a  Fourier  analysis  of  the  speech  signal.   Any  analyzing 
filter  set  such  as  that  described  for  the  formant  extractor 
(Flanagan,  1956)  may  be  used.   The  spectrum  is  then  sampled  at 
a  rate  of  25  to  50  times  per  second,  depending  on  the  quality 
of  speech  required.   Each  sample  K' (w)  of  the  spectral  speech 
function  can  be  described  by  six  to  ten  discrete  levels,  on  the 
average,  although  a  good  reproduction  of  the  vowels  can  be  made 
with  spectra  with  envelopes  defined  by  three  to  five  discrete 
levels  only.   This  means  that  the  form  of  a  spectral  function 
can  be  transmitted  in  the  frequency  band  limited  to  third  to 
fifth  harmonics  of  the  sampling  frequency? 

BW  =  n  F  =  (3  to  5) (25  to  50)  =  75  to  250  cps     (35) 
s 

The  same  channel  must  also  transmit  pitch  information  which 
requires  another  band  of  50  cps.   The  total  frequency  band  of 
a  harmoniphone  telephone  channel  can  be  transmitted  in  a  band- 
width of  100  to  300  cps,  depending  on  the  required  quality  of 
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Fig*  21 •  Frequency  response  of  resonance  vocoder  and  baseband 
system  (Flanagan,  1959) • 
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speech. 

The  analyzed  signal,  S(t)  is  applied  to  the  synchronous 
detectors  SDla,  SD2a  ...  SDlb,  SD2b. . .  and  the  filter  of  the 
constant  component  F  (See  Fig.  22) .   The  synchronous  detectors 
are  controlled  by  the  sampling  frequency  F  and  its  harmonics, 
which  are  obtained  from  the  synchronizer  equipment  in  orthogonal 
phase  relations  (sin  wt,  cos  wt,  sin  2wt,  and  cos  2wt,  etc. 
where  w  =  2ttF  )  .   A  synchronizer  is  generally  used  for  many 
vocoders . 

The  synchronous  detectors  work  into  integrating  circuits 

with  outputs  of 

T 

a,  =(2/T)jS(t)  cos  kwt  dt  (36) 
K      o 

and 

/  T 

b,  =(2/T#S(t)  sin  kwt  dt  (37) 
K      o 

which  are  proportional  to  the  coefficients  of  the  Fourier  series 
of  S(t) .   When  coefficients  a,  and  b,  are  known  it  is  possible 
to  synthesize  a  four-terminal  network  whose  frequency  character- 
istic will,  with  an  accuracy  up  to  the  highest  harmonic  of  the 
Fourier  analysis  of  S(t),  correspond  to  the  envelope  of  the 
shorttime  spectrum  K' (w) ,  because  the  coefficients  fully  deter- 
mine the  shape  of  the  frequency  characteristic  K(w) . 

A  Chebyshev-Type  Vocoder 

A  speech  bandwidth  compression  system  based  on  the  trans- 
mission of  a  limited  number  of  signal  parameters  proportional  to 
the  coefficients  of  expansion  of  the  instantaneous  pulse  response 
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of  the  synthesizer  in  a  series  of  weighted  Laguerre  polynomials 
was  developed  by  Kulya  (1963).   The  system,  similar  to  the 
harmoniphone  system,  belongs  to  the  class  of  orthogonal  systems 
in  which  the  transfer  constants  and  the  pulse  response  of  the 
synthesizer  are  expressed  as  orthogonal  functions.   As  a  result, 
increasing  the  number  of  terms  of  the  approximating  series  makes 
it  possible  to  achieve  a  transmission  of  the  approximated 
spectral  function  as  accurately  as  desired.   Furthermore,  it  is 
possible  to  retain  any  small  number  of  the  signal  parameters 
without  modifying  any  of  the  apparatus. 

The  application  of  Laguerre  functions  in  an  orthogonal 
vocoder  leads  to  two  desirable  qualities.   First,  there  is  an 
improvement  in  the  synthesized  signal  quality  for  a  limited  number 
of  signal  parameters  due  to  the  practical  acceptance  of  nonuni- 
form approximation  to  the  frequency  scale  in  the  relationship 
between  the  instantaneous  spectral  intensity  of  the  speech  signal 
and  frequency.   The  second  merit  of  the  system  is  the  simplifica- 
tion of  circuit  solutions,  and  the  reduction  in  size,  weight, 
and  cost  of  the  apparatus  due  to  the  exclusion  of  low-frequency 
inductance  coils. 

If  the  envelope  of  the  modulus  of  the  speech  signal  spec- 
trum is  of  the  form  S(w, t),  then  the  pulse  response  of  the  syn- 
thesizer can  be  represented  as 

oo 

IT 


g(T,t)  =  £  J  S(w,t)  cos  wr  dw  (38) 

o 

and  the  signal  parameters,  which  are  proportional  to  the  co- 


efficients of  their  expansion  into  a  series  of  Laguerre  functions 
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where 


on   the  right   semiaxis  of  time  T  are 

oo  oo 

An(t)    =   J  -g(T,t)L    (T)    dT-   J.S(w,t)f    (w)dw,    forT>0 

(39) 

which  are  the  orthogonal  Laguerre  functions  of  nth  order  for 
integral  values  of  n. 

The  cosine  transform  of  L  (T)  is 

n ___ 

r°°  T2n  +  1  (     /l  +  4w2A2) 

0/  n(w)  =  I  l(T)  cos  wr  dr  =  2n  li L^= (41) 

where  T2n  + - (x)  are  the  Chebyshev  polynomials  of  the  first  kind 

and  (2n+l)th  order.   Equation  39  can  be  approximated  by  the 

finite  sum 

m 

an(t)  =  E^V^W^  (42) 

k  =  l 

where  S(w,  ,t)  are  the  readings  of  the  envelope  of  the  instan- 
taneous spectrum  taken  along  the  frequency  axis  for  the  values 
w  =  w,  ,  m  is  number  of  readings,  and  Aw,  =  w,  ,  -  -  w,  . 


The  signal  parameters  a  (t) ,  Equation  42,  are  obtained  from 
the  speech  signal  with  the  aid  of  the  analyzer  shown  in  Fig.  23. 
The  readings  S (w,  , t)  of  the  approximate  amplitude  of  the  envelope 
of  the  instantaneous  speech  spectrum  are  obtained  at  the  outputs 
of  the  17  spectral  channel  filters  B0~    to  B0-,-  and  amplifier- 
rectifiers  B-  to  B--.   Since  the  functions  u7  (w)  are  of  dual 
polarity,  the  amplifier-rectifiers  should  provide  both  positive 
and  negative  output  voltages.   The  output  voltages  of  each  of  the 
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spectral  channels  are  summed  at  each  of  the  eight  outputs  a  (t) , 
a, (t) ,  ...  a7(t)  with  coefficients  proportional  to  the  readings 
of  the  corresponding  functions  u^(w)  at  the  points  w  =  w,  .   This 
conversion  is  achieved  by  means  of  a  resistive  matrix  reader  ( 
(Kulya,  1963) . 

The  signal  parameters  obtained  are  smoothed  out  by  low- 
frequency  filters  0O  to  07  with  a  pass-band  of  25  cps.   The 
circuit  and  typical  frequency  characteristic  of  a  filter  are 
shown  in  Fig.  24. 

The  instantaneous  pulse  response  of  the  synthesizer  as  a 

Fourier  transformation  of  S(w, t)  should  be  of  the  form 

7 
g(t,  |T|)  =  £  an(t)  Ln(|Y|),  (43) 

n  =0 
A  response  of  such  a  form  is  not  physically  realizable  since  it 

extends  throughout  the  entire  time  axis,  -  °°  £  T  £  °°.   The  syn- 
thesizer pulse  response  may  be  expressed  in  the  following  form 

instead  of  as  in  Equation  43: 

7 
g(t,r>         [an(t>     t     n(r)    +Lm+n+1(r)]  (44) 

n=0      L 
for  7-  ^  0  which  is  physically  realizable.   The  corresponding 

circuit  is  shown  in  Fig.  25.   The  response  of  the  cascaded  "a" 

and  the  n  "b"  sections  is  in  the  form  of  a  Laguerre  function  of 

nth  order. 

To  determine  the  optimum  number  of  terms  to  be  retained  in 

Equation  43,  the  following  relationship  can  be  developed: 
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-2 j  arctan  2w/\  „  /     z — - 

S(W'tJ  =^T^^r-Joan<t'T2n+l^1+4«2A2) 


7 
-2j  arctan  2w/X 


E  an(t)0n(w/X)  (45) 


Vl  +  4w2A2 

Each  term  under  the  summation  sign  in  Equation  45  is  an  oscilla- 
ting function  with  a  number  of  zeros,  on  the  w-axis,  equal  to  the 
order  n  of  the  function.   The  zeros  are  not  equally  spaced  and, 
for  the  finite  expression,  the  accuracy  of  approximation  of  the 
function  S (w, t)  decreases  with  an  increase  in  frequency  w.   A 
measure  of  this  change  of  accuracy  may  be  obtained  from  the  rela- 
tionship 

G  =  arctan  2wA  (46) 

where  6  is  in  degrees. 

3 

Kulya  found  the  optimum  value  of  X.  to  be  \  =  5.34tt  x  10  and 

the  required  bandwidth  proportional  to  the  frequency.   The  band- 
width required  for  transmission  is  175  cps  plus  the  guard  space. 
Assuming  a  guard  space  of  10  cps  the  total  bandwidth  would  be 
255  cps,  or  a  compression  ratio  of  12:1  with  an  articulation 
efficiency  of  approximately  85%. 


56 


DISCRETE  SOUND  ANALYSIS-SYNTHESIS  METHODS 

Phonetic  Pattern  Recognition  Vocoder 

The  phonetic  pattern  recognition  vocoder  (Dudley,  1958) ,  as 
shown  in  Fig.  26,  compares  observed  phonetic  patterns  with  stored 
standard  patterns  and  transmit  sufficient  information  to  synthe- 
size the  recognized  pattern.   The  speech  is  filtered  by  a  set  of 
ten  band-pass  filters  each  300  cps  wide  except  the  first  which  is 
250  cps.   Each  band-pass  circuit  contains  an  amplifier,  a  rec- 
tifier, and  a  low-pass  filter  to  smooth  the  speech  power  to  a 
syllabic  rate.   An  amplifier  gain  adjustment  is  used  as  a  con- 
venient means  of  frequency  equalization.   The  output  of  the 
band— pass  circuit  is  fed  to  the  memory  circuit,  which  consists 
of  a  10  x  10  matrix  of  potentiometers.   These  are  set  to  yield 
the  spectral  frequency  pattern  for  the  sound  i,  as  in  seat,  in 
the  first-row,  I  as  in  sit,  in  the  second  row,  and  so  on  to  the 
tenth  row  for  s.   The  potentiometer  settings  are  determined  from 
measurements  of  the  band-pass  circuit  outputs  as  sounds  are 
sustained.   A  relay,  not  shown,  set  to  operate  slightly  above 
the  noise  level  in  each  band,  feeds  a  capacitor  through  high 
resistance.   As  the  reference  sounds  are  spoken  (sustained)  by 
the  speaker,  the  charges  on  the  capacitors  build  up  according 
to  the  spectral  pattern,  at  the  conclusion  of  the  sound,  the 
relays  open,  holding  the  capacitor  voltages  for  measurement. 
The  voltage  data  thus  obtained  is  normalized  so  that  the  smallest 
voltage  is  reduced  to  zero.   The  voltage  settings  for  the  matrix 
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are  listed  in  Table  IV.   The  listing  for  row  p  and  column  b  are 
normalized  voltages  from  the  matrix  potentiometer  output,  in  the 
pth  row  of  the  bth  column  (band-pass  circuit)  as  the  sound 
prototype  for  the  pth  phonetic  pattern  sustained  long  enough  to 
give  a  set  of  reasonably  large  voltages.   The  ratios  of  these 
values  to  150v  are  the  matrix  transfer  ratios. 

The  output  voltage  v   from  the  matrix  for  the  pth  phonetic 

Mr 

pattern  branch  at  any  instant  as  the  sth  speech  sound  is  spoken 

is  given  by  the  summation 

10 


v   =  )  r  ,  v  .  (47) 

ps   L     pb   sb 

b=l 
where  v  ,  is  the  instantaneous  smoothed  rectified  voltage  out- 
put of  the  bth  band-pass  circuit  as  the  sth  sound  is  spoken,  and 
r  ,  is  the  voltage  transfer  ratio  for  the  potentiometer  setting 
of  the  bth  band-pass  circuit  for  the  pth  phonetic  pattern  as 
given  in  Table  IV  (after  normalization) .   At  any  instant,  for 

one  value  of  p,  v   will  have  a  maximum  value, 

ps 

v~  =  largest  v   ,  for  p  =  P.  (48) 

Ps  ps 

The  Pth  phonetic  pattern  would  then  be  the  pattern  assigned  by 
the  apparatus  to  that  portion  of  the  sth  sound.-  In' general, 
memory  pattern  P  would  have  the  best  spectral  match  at  that 
instant  with  the  portion  of  sth  sound  being  spoken. 

The  actual  selection  of  phonetic  pattern  for  best  achiev- 
able instantaneous  match  takes  place  to  the  right  of  the 

resistance  matrix  in  Fig.  26.  The  ten  voltages  v   from  the 

ps 

phonetic  pattern  matrix  are  amplified  by  the  individual  buffer 
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amplifiers,  whose  outputs  are  connected  through  biasing  diodes 
D  and  transformer  winding  T  to  ground  via  the  lOkc  series 
resonant  circuit.   A  lOkc  oscillator  supplies  the  sensing 
voltage  through  capacitors  C.  and  C  for  all  of  the  matrix  out- 
puts.  A  voltage-biasing  circuit  attached  to  the  lOkc  resonant 
circuit  provides  a  minimum  threshold  to  prevent  operation  by 
noise.   When  any  branch  voltage  v   become  large  enough  to 
overcome  this  bias,  it  will  pass  dc  through  its  own  branch 
diodes  D,  and  lower  its  resistance  so  that  lOkc  current  flows 
through  the  transformer  winding  T  .   The  dc  current  adds  bias 
in  the  same  direction  as  the  original  bias  from  noise-bias 
battery  E_,  increasing  the  bias  against  current  from  any  of  the 
other  branches.   Thus,  the  selection  is  completed  with  the  lOkc 
current  transmitted  only  in  the  branch  having  the  strongest 
signal.   The  resulting  lOkc  pattern  recognition  current  is 
rectified  -by  diode  D'  and  then  after  smoothing  is  fed  to  the 
synthesizer  control  circuit  (Dudley  1958) . 

A  phonetic  pattern  recognition  vocoder  is  shown  in  Fig.  27. 
The  direct  speech  is  recorded  through  a  3  kc  low-pass  filter  for 
later  comparison  with  the  speech  produced  by  the  vocoder.   The 
analyzer  has  devices  for  both  pitch  and  spectrum  determination. 
The  pitch  circuit  is  the  same  as  that  used  in  the  channel 
vocoder.   The  spectrum  analyzer  is  the  phonetic  pattern  recog- 
nizer described  above.   The  analyzer  recognizes  only  ten 
patterns  corresponding  to  four  consonants  and  vowels. 

A  set  of  ten  resistors  for  each  pattern  recognized  control 
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a  10  channel  vocoder  synthesizer  at  the  receiver.   Channel 
resistors  are  so  chosen  as  to  provide  proper  amounts  of  current 
for  each  pattern  recognized.   Each  set  of  resistances  is 
followed  by  a  25  cps  low-pass  filter  so  that  the  current  to  the 
synthesizing  modulators  passes  from  one  value  to  another 
smoothly.   The  fixed  voltage  for  each  pattern  recognized  allo- 
cates energy  source  via  the  ten  resistors  in  the  set  for  that 
pattern  in  such  a  way  that  the  appropriate  spectrum  is  produced 
in  the  vocoder  output. 

If  the  recognition  process  could  be  made  to  approach  the 
human  facility,  the  number  of  phonetic  patterns  would  be  roughly 
comparable  to  the  number  of  alphabetic  characters  used  in  tele- 
graphy.  In  other  words,  if  only  32  characters  need  be  trans- 
mitted to  give  the  26  letters  of  the  alphabet,  then  for  the 
spoken  word  as  well  only  32  characters  are  required.   That  is, 
five  bits  of  information  or  at  the  most  64  characters  or  six 
bits,  need  to  be  transmitted  provided  the  information  is  limited 
to  the  sort  produced  in  the  written  word. 

The  phonetic  pattern  recognition  vocoder  has  several  short- 
comings.  There  are,  chiefly,  limited  number  of  patterns  avail- 
able and  no  adequate  provision  for  recognition  of  patterns  where 
the  change  of  power  with  time  is  an  essential  characteristic,  as 
in  "plosives"  (t,b,p ). 

If  the  pattern  recognition  were  perfect,  the  required  band- 
width for  32  characters  would  be 

B  =  h.   ICW  =  !j  x  5  x  5  x  4  =  50  cps  (49) 
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where  I  is  the  number  of  "Nyquist"  time  intervals  per  character, 
C  is  the  average  number  of  characters  per  word,  and  W  is  the 
average  number  of  words  per  second;  with  I  =  5,  C  =  5  characters 
per  word,  and  W  =  4  words  per  second. 

However,  the  recognition  is  not  perfect,  and  the  required 
bandwidth  for  a  16  character  alphabet  was  found  to  be  approx- 
imately 100  cps  (Dudley,  1958)  and  a  resulting  bandwidth 
reduction  factor  of  30:1. 


S 
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SOUND  GROUP  ANALYSIS-SYNTHESIS  METHODS 

A  system  which  automatically  recognizes  entire  words,  trans- 
mits a  unique  code  for  each  word,  and  at  the  receiver  converts 
the  word  code  to  synthetic  speech  may  offer  a  means  of  trans- 
mission at  extremely  low  information  rates  for  a  limited  vocab- 
ulary.  For  example,  a  vocabulary  of  32  words  can  be  transmitted 
at  information  rates  of  10  bits  per  second.   A  literature  search 
shows  that  no  attempt  has  been  made  to  produce  a  complete  sys- 
tem, but  both  analyzers  and  synthesizers  have  been  investigated 
separately. 

Work  was  done  on  an  analyzer  (Kock,  1956  and  Dudley,  1958) 
which  would  automatically  recognize  spoken  digits.   The  device 
analyzes  the  speech  input  to  determine  which  sound  in  its  memory 
is  most  similar  to  the  observed  sound.   First  it  breaks  the 
spoken  digit  into  a  series  of  sound  identifications  and  then  by 
comparison  determines  which  of  the  ten  digits  in  its  memory  has 
the  same  sequence.   It  can  recognize  the  digits  as  spoken  by  the 
voice  for  which  the  system  is  calibrated,  but  fails  to  perform 
satisfactorily  for  any  other  voice,  or  even  a  change  in  the 
manner  of  speaking  for  the  calibrating  voice. 

Magnetic  tape  playbacks  operated  from  a  source  of  digital 
information  provide  a  satisfactory  device  for  the  synthesis  of 
complete  words.   The  Automatic  Voice  Readout  System  (AVRS) 
(Poppe  and  Suhr,  1957)  shown  in  Fig.  28  is  an  example  of  such  a 
device.   The  AVRS  is  intended  for  use  as  a  digital  code  to  voice 
converter  to  read  out  commands  from  a  digital  computer  vocally. 
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Five  digit  control  signals  required  to  control  a  relay  pyramid 
are  supplied  from  a  coding  unit  and  synchronizer.  The  output  of 
the  relay  pyramid  drives  an  audio  amplifier.  The  synchronizer 
serves  as  the  basic  timing  within  the  system.  Synchronization 
pulses  are  received  every  one-half  second,  and  are  used  to 
advance  the  synchronizer  to  the  next  word  interval.  The  func- 
tion of  the  coding  unit  is  to  convert  the  input  code  to  a  five 
digit  parallel  code  required  to  drive  the  relay  pyramid. 

A  system  consisting  of  a  limited  vocabulary  coding  analyzer 
and  a  limited  vocabulary  decoding  synthesizer  such  as  the  AVRS 
would  constitute  an  operable  voice  communication  system  of  very 
low  information  rate.   Although  such  a  system  has  a  very  limited 
vocabulary,  there  are  certain  situations  such  as  aircraft 
traffic  control,  where  such  a  limited  vocabulary  would  be  en- 
tirely satisfactory.   A  diagram  of  such  a  system  is  shown  in 
Fig.  29. 
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CONCLUSIONS 

The  problem  of  speech  bandwidth  compression  centers  around 
the  question  as  to  what  is  the  essential  information  contained 
in  the  speech  spectrum  that  must  be  transmitted  in  order  that 
intelligible  speech  may  be  reproduced;  and  as  to  what  is  the 
most  reliable  means  of  accomplishing  it.   Bandwidth  compression 
techniques  may  be  grouped  into  four  general  categories,  namely, 
time  and  frequency  compression  methods,  continuous  analysis- 
synthesis  methods,  discrete  sound  analysis-synthesis  methods, 
and  sound  group  analysis-synthesis  methods.   All  of  these  tech- 
niques attempt  to  eliminate  the  non-essential  information  for 
the  purpose  of  the  transmission. 

The  envelope  of  the  power  spectrum  carries  the  essential 
information,  and  therefore,  it  is  desired  to  reconstruct  the 
power  spectrum  envelope  as  nearly  as  possible  from  the  limited 
transmitted  information  produced  by  the  analyzer  of  such  a 
system.   Some  techniques  produce  good  quality  speech  but  exhibit 
very  modest  bandwidth  reduction,  such  as  the  Vobanc  (2:1)  and 
Codimex  (4:1)  systems.   Other  techniques,  such  as  phonetic 
pattern  recognition  vocoders,  channel  vocoders,,  and-  resonance 
vocoders,  achieve  high  bandwidth  reduction  factors  of  approx- 
imately 30:1,  20:1,  and  10:1  respectively,  but  lack  in  quality 
of  the  reproduced  speech.   This  loss  of  quality  is  caused  by 
errors  in  the  pitch  extraction  and  voiced-unvoiced  decisions. 
Also,  since  the  synthesizers  can  only  produce  either  a  continuous 
spectrum  or  a  discrete  spectrum,  no  semi- vowels  can  be  produced 
by  systems  employing  such  excitation.   This  difficulty  can  be 
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overcome,  to  a  great  extent,  by  employing  a  baseband  of  the 
original  speech  and  deriving  the  excitation  from  the  baseband. 
Such  a  device  is  called  a  voice-excited  vocoder.   Such  vocoders 
reproduce  higher  quality  speech  than  their  counterparte,  but  the 
bandwidth  reduction  (4  to  5:1)  suffers  due  to  the  bandwidth 
required  for  the  transmission  of  the  baseband.   In  general,  the 
quality  of  the  speech  has  little  effect  on  the  intelligibility 
of  the  speech,  but  any  attempt  to  improve  the  quality  results  in 
less  bandwidth  reduction.   The  autocorrelation  vocoder  seems  to 
be  the  only  system  that  produces  both  reasonable  quality  and 
reasonable  bandwidth  reduction  (10:1).   This  vocoder  also  experi- 
ences the  pitch  and  excitation  difficulties,  but  performs  cred- 
itably in  spite  of  them. 

Discrete  sound  analysis-synthesis  methods  and  sound  group 
analysis-synthesis  methods  are  capable  of  very  low  information 
rates  of  transmission  of  60  bits  per  second  and  5  bits  per  second, 
respectively,  but  have  several  serious  drawbacks.   First  of  all, 
in  order  to  use  such  systems  in  general  speech  communication 
links,  a  very  large  vocabulary  would  be  required.   Thus,  situa- 
tions where  limited  vocabularies  are  used  are  the  more  likely 
possibilities  for  these  systems. 

These  constitute  the  major  bandwidth  compression  techniques. 
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ABSTRACT 

Application  of  speech  bandwidth  compression  techniques  to 
voice  communications  provides  the  promise  of  more  efficient  use 
of  the  radio  spectrum  and  improved  performance  of  noisy,  long 
distance  communications  links.   Proof  of  the  need  for  bandwidth 
compression  can  be  found  in  the  fact  that  conventional  speech 
transmission  requires  a  transmission  rate  of  approximately  24,000 
bits  per  second,  while  the  same  information  transmitted  by  tele- 
type requires  a  rate  of  only  75  bits  per  second.   This  need  for 
conservation  of  communication  space  brought  about  the  develop- 
ment of  various  techniques  of  bandwidth  compression  which  is  the 
subject  of  this  report. 

There  is  a  considerable  difference  between  the  information 
rate  necessary  to  communicate  the  speech  signal  in  the  conven- 
tional manner  and  the  actual  rate  which  information  appears  to  be 
generated  by  the  vocal  mechanism.   The  additional  information 
rate  is  manifested  in  the  identity  of  the  speaker  and  his  emo- 
tional status,  redundancy,  and  inefficient  use  of  the  spectrum. 
In  general,  all  speech  bandwidth  compression  systems  attempt  to 
exploit  one  or  more  of  these  factors  to  obtain  a  reduction  in 
the  bandwidth   and  thus  the  required  channel  capacity. 

The  bandwidth  compression  techniques  may  be  grouped  into 
four  general  categories: 
1)   Time  and  frequency  compression  methods. 

Such  methods  exploit  the  redundancy  or  regularities  existing 
in  the  speech  signal  by  sampling  and  frequency  division  techniques. 


These  systems  generally  exhibit  bandwidth  compression  in  the 
order  2:1  to  10:1  and  can  be  transmitted  in  binary  code  form 
over  channels  of  2,000  to  10,000  bits  per  second  capacity. 

2)  Continuous  analysis-synthesis  methods. 

In  place  of  the  speech  signal  spectrum  such  methods  trans- 
mit a  description  of  the  spectrum  in  terms  of  a  number  of  analog 
parametric  control  signals.   As  such,  they  exploit  both  the 
redundancy  and  inefficiency  existing  in  the  speech  signal.   In 
general,  these  systems  exhibit  bandwidth  compression  in  the 
order  of  10:1  to  20:1  with  a  required  channel  capacity  of  1,000 
to  2,000  bits  per  second. 

3)  Discrete  sound  analysis-synthesis  methods. 

Such  methods  transmit  in  place  of  the  speech  signal  code 
groups  which  identify  the  fundamental  sounds  that  constitute  the 
speech.   As  such,  they  exploit  the  redundancy  and  inefficiency 
of  the  speech  signal,  and  remove  the  identity  and  emotional 
status  cues.   It  is  expected  that  such  systems  should  be  capable 
of  transmitting  speech  at  information  rates  as  low  as  60  bits 
per  second. 

4)  Sound  group  analysis-synthesis  methods. 

Such  methods  transmit  only  certain  groups  of  sounds  (par- 
ticular words  and  phrases) ,  each  identified  by  a  code  group. 
Information  rates  in  this  case  are  a  function  of  the  size  of 
the  vocabulary.   Such  a  system  appears  to  be  most  useful  at 
information  rates  in  the  order  of  5  to  10  bits  per  second. 

Although  some  of  the  methods  achieve  very  high  compression 


ratios,  their  articulation  efficiency,  in  some  cases,  may  be 
too  low  for  practical  use.   In  general,  bandwidth  compression 
must  be  sacrificed  to  improve  the  articulation  efficiency. 


